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Abstract —This paper investigates one of the most fundamental 
computer vision problems: image segmentation. We propose a 
supervised hierarchical approach to object-independent image 
segmentation. Starting with over-segmenting superpixels, we use 
a tree structure to represent the hierarchy of region merging, 
by which we reduce the problem of segmenting image regions to 
finding a set of label assignment to tree nodes. We formulate the 
tree structure as a constrained conditional model to associate 
region merging with likelihoods predicted using an ensemble 
boundary classifier. Final segmentations can then be inferred by 
finding globally optimal solutions to the model efficiently. We also 
present an iterative training and testing algorithm that generates 
various tree structures and combines them to emphasize accurate 
boundaries by segmentation accumulation. Experiment results 
and comparisons with other recent methods on six public data 
sets demonstrate that our approach achieves state-of-the-art 
region accuracy and is competitive in image segmentation without 
semantic priors. 

Index Terms —Image segmentation, hierarchical merge tree, 
constrained conditional model, supervised classification, object- 
independent, ensemble model. 


1. Introduction 

I MAGE segmentation is an important mid-level computer 
vision problem that has been studied for a long time yet 
remains challenging. General image segmentation is used as 
a pre-processing step for solving high-level vision problems, 
such as object recognition and image classification. In many 
inter-disciplinary areas, e.g., biological and medical imaging, 
image segmentation also plays a significant role in helping 
scientists quantify and analyze image data. While a lot of 
research has been done to achieve high segmentation accuracy 
for specific types of images, the quality of image segmentation 
for general scenes is still less than satisfactory. 

In this paper, we introduce a supervised learning based im¬ 
age segmentation framework, namely, the hierarchical merge 
tree model. Starting with over-segmenting superpixels, we 
propose to represent the region merging hierarchy with a tree¬ 
like constrained conditional model. An ensemble boundary 
classifier is trained to score each factor in the graphical 
model. A globally optimal label assignment to the model is 
computed by minimizing the total energy under the region 
consistency constraint, and a final segmentation is recovered 
from the labeling. We also propose an iterative approach that 
generates various region merging hierarchies and combines 
them to improve the overall performance via segmentation 
accumulation. We conduct extensive experiments for empirical 
validation. Comparisons with other very recent methods on 

T. Liu, M. Seyedhosseini, and T. Tasdizen are with the Scientific Computing 
and Imaging Institute, University of Utah, Salt Lake City, UT 84112, USA. 
E-mail: {ting,mseyed,tolga} @sci.utah.edu 

T. Liu is also with the School of Computing, University of Utah. 

M. Seyedhosseini and T. Tasdizen are also with the Department of Electrical 
and Computer Engineering, University of Utah. 


six public data sets show that our proposed method produces 
state-of-the-art region segmentation results. 

We begin with a review of previous related work on general 
image segmentation methods in Section |n] In Section |m| we 
illustrate our hierarchical merge tree as a constrained condi¬ 
tional model and introduce the ensemble boundary classifier. 
In Section |lv| we describe a modification to the hierarchical 
merge tree model with iterative segmentation accumulation. 
Experimental results are shown in Section [V| in which we 
compare the segmentation performance of our method with 
different settings, as well as with other recent state-of-the-art 
methods. In Section |Vl| we summarize our current work and 
discuss the possible improvement for the future. 


II. Previous work 

There are two different perspectives of image segmenta¬ 
tion m. One is edge detection, which aims at finding edges 
between different perceptual pixel groups. The other one is 
region segmentation, which partitions an image into disjoint 
regions. Usually, edge detection focuses on assigning a binary 
label to each pixel with certain confidence indicating if it 
belongs to an edge or not and does not guarantee closed 
object contours. Though closed contours and thus regions they 
encircle can be recovered from edges, such transformation 
with high accuracy is usually non-trivial. On the other hand, 
region segmentation seeks to find the cluster membership of 
each pixel, and closed contours of an object can be trivially 
generated as the outmost points of a region. Many region seg¬ 
mentation methods also take advantage of the edge detection 
outputs as boundary cues to help with the search for correct 
partitioning. Our method belongs to the region segmentation 
category, and in this section we emphasize reviewing previous 
related works in this category. 

Eirst, we briefiy summarize related edge detection works. 
Early edge detections are mostly based on image deriva¬ 
tives m, |3) or filter banks responses a, 0 . More recent 
works utilize richer information such as colors and textures. 
One of the most notable works, gPb HI, combines multi-scale 
local cues and globalized cues via spectral clustering and sets 
up a benchmark for edge detection and region segmentation 
research. Taking advantage of supervised learning techniques 
has also become the recent trend in edge detection. Ren 
and Bo 0 train a classifier with sparse codes on local 
neighborhood information and improve the edge detection 
performance. Dollar and Zitnick 171 propose a structured 
learning framework using modified random decision forest 
for efficient edge detection. Seyedhosseini and Tasdizen 111 
propose a hierarchical model to capture multi-scale contextual 
information and achieve state-of-the-art edge detection perfor¬ 
mance. 
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Early works on region segmentation seek to directly group 
image pixels in an unsupervised manner. Belongie et al ||9l 
fit Gaussian mixture models to cluster pixels based on six¬ 
dimensional color and texture features. Mean shift Co) and 
its variant im consider region segmentation as a density 
mode searching problem. A number of works belong to graph 
partitioning category, which regards an image as a graph with 
pixels being nodes and edge weights indicating dissimilarity 
between neighbor pixels. Normalized cuts ifT^ takes the image 
affinity matrix and partitions an image by solving eigenvalue 
problems. Felzenszwalb and Huttenlocher ca propose to 
greedily merge two connected components if there exists an 
inter-component edge weight that is less than the largest edge 
weights in the minimum spanning trees of both components. 
Arbelaez et al. propose a variant of watershed transform 
to generate a hierarchy of closed contours. We refer readers 
to IhI for a comprehensive review of existing methods. 

As in edge detection, supervised learning based methods 
for region segmentation have gained increased popularity in 
recent years. This trend leads to and is further promoted by 
a number of publicly available computer vision data sets with 
human-labeled ground truth ifTSll . Q, (161, (TTl - HSI, (191 . 
Though unsupervised methods, such as CqI, ED, are shown 
to generate perceptually coherent segmentations, learning seg¬ 
mentation models from supervised data enables much more 
capability and fiexibility of incorporating preference from 
human observers and leads to many more interesting works. 

Following the classic foreground/background segmentation, 
object-independent segmentation methods seek to partition 
an image based only on its appearance and do not utilize 
underlying semantics about the scene or specific information 
about target objects. Kim et al propose a hypergraph-based 
correlation clustering framework f22\ that uses structured 
SVM for learning the structural information from training 
data. Arbelaez et al develop the multi-scale combinatorial 
grouping (MCG) framework (^ that exploits multi-scale 
information and uses a fast normalized cuts algorithm for 
region segmentation. Yu et al present a piecewise fiat 
embedding learning algorithm and report the best published 
results so far on Berkeley Segmentation Data Set using the 
MCG framework. Two other recent superpixel-merging ap¬ 
proaches are ISCRA (25l and GAFA (26l. Starting with a fine 
superpixel over-segmentation, ISCRA adaptively divides the 
whole region merging process into different cascaded stages 
and trains a respective logistic regression model at each stage 
to determine the greedy merging. Meanwhile, GAFA improves 
the boundary classifier training by augmenting the training 
set via repeatedly iterating through the merging process. 
Moreover, impressive results in the extensive evaluations on 
six public segmentation data sets are reported in (25l. 

Object-dependent or semantic segmentation is another 
branch of region segmentation. Object-dependent prior knowl¬ 
edge is exploited to guide or improve the segmentation pro¬ 
cess. Borenstein and Ullman EH formulate object segmen¬ 
tation as a joint model that uses both low-level visual cues 
and high-level object class information. Some other object 
segmentation methods first generate object segmentation hy¬ 
potheses using low-/mid-level features and then rank segments 


with high-level prior knowledge (28l, 1291 . A recent work, 
SCAFPEF (sol , incorporates high-level information in the 
segmentation process and can generate object proposals more 
efficiently and accurately. There are also a group of methods, 
called co-segmentation, that utilizes the homogeneity between 
different target objects and jointly segments multiple images 
simultaneously ED, E2, E3 

Our method falls into the object-independent hierarchical 
segmentation category. A preliminary version of our method 
with the merge tree model and a greedy inference algorithm 
appeared in Ea, ES and was only applied to segmenting 
electron microscopy images, apart from which the contribu¬ 
tions of this paper include: 

• Reformulation of the hierarchical merge tree as a con¬ 
strained conditional model with globally optimal solu¬ 
tions defined and an efficient inference algorithm devel¬ 
oped, instead of the greedy tree model in (341 . (21. 

• An iterative approach to diversify merge tree generation 
and improve results via segmentation accumulation. 

• Experiments that extensively compare different variants 
and settings of the hierarchical merge tree model and 
show the robustness of the proposed approach against 
image noise at testing time. 

• Experiments with state-of-the-art results on six public 
data sets for general image segmentation. 

Compared with recent competitive hierarchical segmenta¬ 
tion methods, ISCRA (^ and GAFA (21, which use a 
threshold-based greedy region merging strategy, our hierar¬ 
chical merge tree model has two major advantages. First, the 
tree structure enables the incorporation of higher order image 
information into segmentation. The merge/split decisions are 
made together in a globally optimal manner instead of by 
looking only at local region pairs. Second, our method does 
not require the threshold parameter to determine when to stop 
merging as in ISCRA and GAFA, which may be so important 
to the results that needs carefully tuning. Furthermore, our 
method is almost parameter-free given the initial superpixel 
over-segmentation. The only parameter is the number of 
iterations, which can be fixed as shown in the experiments 
on all the data sets. 

III. Hierarchical merge tree model 

Given an image / consisting of pixels V, a segmentation 
is a partition of V, denoted as S' = {si G 2^ | 13^5^ = 
V]\/i 7 ^ j^Si n Sj = 0}, where 2^ is the power set of V. 
A segmentation assigns every pixel an integer label that is 
unique for each image object. Each Si, which is a connected 
subset of pixels in V, is called a segment or region. All 
possible partitions form a segmentation space S^p. A ground 
truth segmentation Sg G Sp is usually generated by humans 
and considered as the gold standard. The accuracy of a 
segmentation S is measured based on its agreement with 
5'^. In a probabilistic setting, solving a segmentation problem 
is formulated as finding a segmentation that maximizes its 
posterior probability given the image as 

S'* = argmax P(5'I/). (1) 

seSv 
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The current trend to alleviate the difficulty in pixelwise 
search for S'* is to start with a set of over-segmenting 
superpixels. A superpixel is an image segment consisting of 
pixels that have similar visual characteristics. A number of 
algorithms O. lUSll. ll^. IJTII. ll3^ can be used to generate 
superpixels. In this paper, we use the watershed algorithm 13^ 
over the output of the boundary detector gPb m 

Let So be the initial over-segmentation given by the su¬ 
perpixels, the final segmentation consisting only of merged 
superpixels in So can be represented sls S = {si G 2^ | = 

7 ^ j,Si nsj = 0;Vz,3S" e 2'^“,s.t. Si = Us/gs/s'}. 
Therefore, the search space for S is largely reduced to 5 C 
tSp. Even so, however, exhaustive search is still intractable, 
and some kind of heuristic has to be injected. We propose 
to further limit 5 to a set of segmentations induced by tree 
structures and make the optimum search feasible. 

A. Hierarchical merge tree 

Consider a graph, in which each node corresponds to a 
superpixel, and an edge is defined between two nodes that 
share boundary pixels with each other. Starting with the 
initial over-segmentation So, finding a final segmentation, 
which is essentially the merging of initial superpixels, can be 
considered as combining nodes and removing edges between 
them. This superpixel merging can be done in an iterative 
fashion: each time a pair of neighboring nodes are combined in 
the graph, and corresponding edges are updated. To represent 
the order of such merging, we use a full binary tree structure, 
which we call the hierarchical merge tree (or merge tree for 
short) throughout this paper. In a merge tree Tr = (V, f), 
a node vf ^ V represents an image segment si G 2^, 
where d denotes the depth in Tr at which this node occurs. 
Leaf nodes correspond to initial superpixels in So. A non¬ 
leaf node corresponds to an image region formed by merging 
superpixels, and the root node corresponds to the whole image 
as one single region. An edge Cij G £ between node vf and 
its child exists when Sj C and a local structure 

represents Si = Sj U Sk. In this 
way, finding a final segmentation becomes finding a subset of 
nodes in Tr. Fig. \Tc\ shows a merge tree example with initial 
superpixels shown in Fig. [T^ corresponding to the leaf nodes. 
The non-leaf nodes represent image regions as combinations of 
initial superpixels. Fig. shows a final segmentation formed 
by a subset of tree nodes. It is noteworthy that a merge tree 
defined here can be seen as a dendrogram in hierarchical 
clustering ioi with each cluster being an image region. 

In order to determine the merging priority, we define a 
merging saliency function f^s : 5'^ ^ M that assigns a 
real number to each pair of regions in S' as a measurement 
of their merging likelihood. For any pair of regions Si and 
Sj that are not neighbors, we define fms{si,Sj) = —oc. 
Then starting from a set of initial superpixels S = So as 
leaf nodes, a merge tree is constructed by iteratively merging 
{s*,s*) = a-rgmaxg.(ZS,ijij fms{si,Sj) to a parent node, 
until only one region remains in S corresponding to the root 
node. Statistics over the strengths of boundary pixels between 
two merging regions from boundary detection probability 


maps may be used as fms- Following Ea, we use negated 
median 

fms{si,Sj) = 1 - median({P6(A:) | k e B{si,Sj)}), (2) 

where Pb{k) is the value of the k-th pixel on some boundary 
detection probability map Pb, and B{si,Sj) is the set of 
boundary pixels between Si and Sj. B can be different on 
implementation. In our work, we define 

B{si, Sj) = {si r]Af{sj)) U {sj r]Af{si )), (3) 

where A/'(s.) is the set of neighbor pixels of s.. We also 
propose to learn the merging saliency function from data in 
Section HV] 


B. Constrained conditional model 


In order to select a subset of nodes that forms an optimal 
segmentation, we formulate the merge tree as a constrained 
conditional model. It is essentially a factor graph for the 
merge tree, in which the node set aligns identically with V, 
and each merge in the merge tree that involves three nodes 




j^Cik}) is considered as a clique pi in 


the factor graph. A label = +1 or — 1 is assigned to 
each node indicating whether its children merge or not. All 
leaf nodes must be labeled +1. A complete label assignment 
Y = {yi}i of all nodes must also be subject to the region 
consistency constraint that if a node is labeled +1, all of 
its descendants must be labeled +1 as well. Then the nodes 
whose labels are +1 and parents are labeled —1 are selected 


as segments in the final segmentation. Fig. is the factor 
graph for the constrained conditional model derived from the 
merge tree in Fig. The red box shows a clique, and a set 
of consistent labeling is shown. 

We train a classifier (Section [lll-Cb to predict the probability 
P{yi) for each merge ({vf, vp\vp}, (ejj, ejfe}). Then we 
score each clique Pi by associating it with energy with respect 
to its label 


T,(^,) = -logP(^,),^, = ±1. (4) 

Under the Markov assumption, we formulate our labeling 
problem as a constrained optimization problem 

imn Ei{yi),yi = ±1, 

VieY 

s.t. yi = +l,Vi,vf is a leaf, 

Vi < is parent to 

for which an inference algorithm will be introduced in Sec- 

tion inrDi 


C. Boundary classifier 

To score each clique, we train a boundary classifier to 
predict the probability of each merge. To generate training 
labels that indicate whether the boundary between two regions 
exists or not, we compare both the merge and the split case 
against the ground truth under certain error metric, such as 
the Rand error ED and the variation of information (VI) 14^ . 
1431 (See Section [\^ for details). The case with smaller error 
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(a) (b) (c) (d) 


Fig. 1. Examp le of (a) an initial segmentation, (b) a consistent final segmentation, (c) a merge tree, and (d) the corresponding conditional model factor graph 
(Section [lII-B^ with correct labeling. In (c), the leaf nodes have labels identical to those of the initial regions. The red nodes correspond to regions in the 
final segmentation. The red box in (d) indicates a clique in the model. 


deviates less from the ground truth and is adopted. In practice, 
we choose VI for its robustness to size rescaling 

Boundary features and region features are extracted for 
classification. For a pair of merging regions, boundary features 
provide direct cues about how it is likely the boundary truly 
exists, and regional features measure geometric and textural 
similarities between the two regions, which can both be 
informative to boundary classification. We choose features fol¬ 
lowing 1^ for comparison purposes. A summary of features 
is provided in Appendix The boundary classifier is not 
limited to any specific supervised classification model. We use 
random forest i44l in our experiments. The parameter setting 
for our random forest is summarized in Appendix 

The boundary classification problem is highly non-linear, 
and learning one universally good boundary classifier for all 
merging cases is essentially difficult. The size of merging 
regions affects the feature representativeness in classification. 
For instance, textural features in the form of averaged his¬ 
tograms among patches may not be informative when the 
merging regions are too small, because textural features can be 
extracted from only a very limited number of image patches 
and is thus noisy. On the other hand, when two regions are 
so big that they contain under-segmentation from different 
perceptual groups, the features again may not be meaningful, 
but for a different reason, that is, the histogram averaging is 
not able to represent the variation of textures. It is worth noting 
that for the same reason, different classifiers have to be learned 
at different merging stages in ||25l. 

We categorize the classification problem into sub-problems, 
train a separate sub-classifier for each sub-problem, and form 
the boundary classifier as an ensemble of sub-classifiers. We 
compute the median size |s|med of all regions observed in the 
training set and assign a category label to a training sample 
that involves regions Si and Sj based on their sizes as in ([^. 
Three sub-classifiers are then trained respectively using only 
samples with identical category labels. 


{ 1, if max(|s,|, \sj\) < |s|nied, 

2, if min(|sj|, |sj|) < |s|n,ed < max(|sj|, |sj|), 
3, otherwise. 


( 6 ) 

At testing time, a sample is categorized based on its 
region sizes and assigned to the corresponding sub-classifier 
for prediction. Since all the sub-classifiers are always used 
adjointly, we refer to the set of all sub-classifiers as the 
boundary classifier in the rest of this paper. 


D. Inference 

Exhaustive search to solve 0 has exponential complexity. 
Given the tree structure, however, we can use a bottom- 
up/top-down algorithm to efficiently find the exact optimal 
solution under the region consistency constraint. The funda¬ 
mental idea of the bottom-up/top-down algorithm is dynamic 
programming: in the bottom-up step, the minimum energies for 
both decisions (merge/split) under the constraint are kept and 
propagated from leaves to the root, based on which the set of 
best consistent decisions is made from the root to leaves in the 
top-down step. It is noteworthy that our bottom-up/top-down 
algorithm is only for inference and conceptually different from 
the top-down/bottom-up framework in isi, which seeks to 
combine high-level semantic information and low-level image 
features. On the other hand, the two-way message passing 
algorithm used in iZTl and our algorithm both belong to 
the Pearl’s belief propagation na, ill category, except that 
our inference algorithm explicitly incorporates the consistency 
constraint into the optimization procedure. 

In the bottom-up step, a pair of energy sums are kept track 
of for each node vf with children and the merging 
energy of node vf and its descendants all being labeled 
-1-1 (merge), the splitting energy Ef of it that vf is labeled 
— 1 (split), and its descendants are labeled optimally subject to 
the constraint. Then the energies can be computed bottom-up 
recursively as 

E- = E^ + E- + Eiivi = +1), (7) 

Et = min(E™, E]) + mm{E^, El) + Ei{yi = -1). (8) 

For leaf nodes, we assign Ef^ = 0 and Ef = oo to 
enforce their being labeled +1. Fig. [^illustrates the bottom-up 
algorithm in pseudocode. 

In the top-down step, we start from the root and do a depth- 
first search: if the merging energy of a node is lower than its 
splitting energy, label this node and all its descendants +1; 
otherwise, label this node —1 and search its children. Fig. 
illustrates the top-down algorithm in pseudocode. 

Eventually, we select the set of the nodes, such that its label 
is +1 and its parent is labeled —1, to form an optimal final 
segmentation. In both algorithms, each node is visited exactly 
once with constant operations, and we need only linear space 
proportional to the number nodes for Te and Y, so the time 
and space complexity are both 0(|V|). 
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Input: A list of energy for each clique pi 

Output: A list of energy tuples Te = ^ 

1: Te ^ 

ComputeEnergyTuples(i;^), where is the root node 
/* Helper function that recursively computes energies */ 

function ConiputeEnergyTuples(^f): 

if vf is a leaf node then 
Te ^ Te ^ {(0, oo)} 

return (0, oc) 
end if 

^ ConiputeEnergyTuples('z;^+^'' 


3 

(E'jf.E^) ^ ConiputeEnergyTuples(i;^^^) 


E- 

Ef 


^ Ef + E\ 


k + Ti{yi — + 1 ) 


mm{Ef. 
return {Ef^Ef 
end function 




■ min(E^, E|) + Ei{yi = -1) 


Fig. 2. Pseudocode of the bottom-up energy computation algorithm. 


Input: A list of energy tuples Te = {{Ef, 

Output: A complete label assignment Y = [yi^^i 
l: Y ^ {} 

AssignNodeLabels(i;^), where is the root node 
/* Helper function that recursively decides node labels */ 

function AssignNodeLabels(^f): 
if Ef < Ef then 

y ^ y U = +1} U {yi> = +1 | Mi' G V{i)}, 
where V{i) is set of indices of descendants of vf 

else 

y ^yu{^, = -l} 


AssignNodeLabels(i;^^^) 
AssignNodeLabels {vff ^) 


end if 

end function 


Fig. 3. Pseudocode of the top-down label assignment algorithm. 


IV. Iterative hierarchical merge tree model 


The performance upper bound of the hierarchical merge tree 
model is determined by the quality of the tree structure. If all 
true segments exist as nodes in the tree, they may be picked 
out by the inference algorithm using predictions from well- 
trained boundary classifiers. However, if a desirable segment 
is not represented by any node in the tree, the model is not able 
to recover the segment. Hence, the merging saliency function, 
which is used to determine merging priorities, is critical to the 
entire performance. With a good merging saliency function, we 
can push the upper bound of performance and thus improve 
segmentation accuracy. 

Statistics over the boundary strengths can be used to indicate 
merging saliency. We use the negated median of boundary 
pixel strengths as the initial representation of saliency, as 
mentioned in Section |III-A[ Since a boundary classifier is 
essentially designed to measure region merging likelihood, 
and it has advantages over simple boundary statistics because 
it takes various features from both boundary and regions, we 
propose to use the merging probabilities predicted by boundary 


classifiers as the merging saliency to construct a merge tree. 

As described in Section |III-C[ the training of a boundary 
classifier requires samples generated from a merge tree, but 
we would like to use a boundary classifier to construct a 
merge tree. Therefore, we propose an iterative approach that 
alternately collects training samples from a merge tree for 
the training of boundary classifiers and constructs a merge 
tree with the trained classifier. As illustrated in Fig. we 
initially use the negated median of boundary strengths to 
construct a merge tree, collect region merging samples, and 
train a boundary classifier ff Then, the boundary classifier 
is used to generate a new merge tree from the same 
initial superpixels So, from which new training samples are 
generated. We next combine the samples from the current 
iteration and from the previous iterations, remove duplicates, 
and train the next classifier ff . This process is repeated for T 
iterations or until the segmentation accuracy on a validation 
set no longer improves. In practice, we fix the iteration number 
to T = 10 for all data sets. Eventually, we have a series of 
boundary classifiers training iteration. The 

training algorithm is illustrated in Fig. 

At testing time, we take the series of trained classifiers and 
iterate in a way similar to the training process, as shown in 
Fig. 1^ at each iteration t, we take the previous boundary 
classifier fl~^ to construct a merge tree over the same initial 
superpixels So and use the current classifier to predict 
each merge score in the merge tree, based on which a 
final segmentation S^ is inferred. Finally, we transform each 
segmentation into a binary closed contour map by assigning 
boundary pixels 1 and others 0 and average them for each 
image over all iterations to generate a segmentation hierarchy 
in the form a real-valued contour map. The testing algorithm 
is illustrated in Fig. 

The explanation for the iterative approach is two-fold. 
First, by collecting samples that were not seen in previous 
iterations, we can explore the merge sample space and in 
turn explore the space of merge trees generated by the 
classifiers trained using the augmented sample set towards 
the “correct” merge tree. Second, like a bagging algorithm, 
segmentation averaging through iterations tends to emphasize 
accurate boundaries by phasing out non-systematic errors due 
to incorrect tree structures or classifier mispredictions. The 
segmentation accumulation alleviates the difficulty of training 
one accurate classifier to generate segmentations by improving 
via averaging. 


V. Experiments 

We conduct experiments with two validation goals. First, we 
evaluate the performance of our hierarchical merge tree model 
with different combinations of settings. Second, we compare 
our method with other state-of-the-art methods. Our source 
code is available at https://github.com/tingliu/glia 


A. Setting 

We experiment with six publicly available data sets for 
image segmentation: 
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(b) 


Fig. 4. Illustrations of (a) the training and (b) the testing procedure of the iterative hierarchical merge tree model. Starting with the fixed initial superpixels 
(“Init Seg”), the first iteration uses boundary probability (“Pb”) statistics for merge tree generation, and the training procedure iteratively augments the training 
set by incorporating new samples from merge trees and trains a new boundary classifier (“BC”), which is used for merge tree generation from the same initial 
superpixels in the next iteration. At testing time, boundary probability statistics and boundary classifiers learned at each iteration are used to generate merge 
trees from the same initial superpixels, and each boundary classifier is used to score merge cliques in the previous iteration; segmentations are generated from 
each merge tree and accumulated to generate the final contour hierarchy. The black lines show the use of initial superpixels, the red lines show the use of 
boundary classifiers, and the blue lines show the flow of sample data collected from tree structures. 


Input: Original images boundary maps 

and iteration number T 
Output: Boundary classifiers 
1: Generate initial superpixels 
2: for t : 0,1,... ,T do 
3: if t == 0 then 

4: Generate from {Soi}f=i using {Pbi}^-^ 

5: else 

6: Generate from using 

7: end if 

8: Generate samples from 

9: Train /* using 

10: end for 

Fig. 5. Pseudocode of the iterative training algorithm. 


1) Berkeley Segmentation Data Set 300 (BSDS300) lITSl : 
200 training and 100 testing natural images of size 
481 X 321 pixels. Multiple ground truth segmentations 
are provided with different labeling of details. 

2) Berkeley Segmentation Data Set 500 (BSDS500) d: an 
extension of BSDS300 with 200 new testing images of 
the same size, with multiple ground truth segmentations 
for each image. 

3) MSRC Object Recognition Data Set (MSRC) (161: 591 
320 X 213 natural images with one ground truth per 
image. A cleaned-up version Ea is used, in which 
“void” regions are removed, and disconnected regions 
that belong to the same object class are assigned differ¬ 


Input: Original images boundary maps {Pbi}^^, 

and boundary classifiers 

Output: Hierarchical segmentation contour map 
1: Generate initial superpixels 
2: for t : 0,1,... ,T do 

3: if t == 0 then 

4: Generate from {Soi}f=i using {Pbi}f^^ 

5: else 

6: Generate from using 

7: end if 

8: Score merges with and infer segmentations 

9: Binarize to contour maps 

10: end for 

11: {C&={Ef=oCl/iT+l)}^^, 

Fig. 6. Pseudocode of the iterative testing algorithm. 


ent labels in a single image. 

4) PASCAL Visual Object Classes Data Set (VOC12) dl: 
1449 validation images with one ground truth per image 
for PASCAL VOC 2012 Challenge. The average image 
size is 496 x 360. We use the ground truth for object 
segmentation and treat the object boundary pixels as 
background. 

5) Stanford Background Data Set (SBD) ifTTl : 715 approx¬ 
imately 320 X 240 images of outdoor scenes with one 
ground truth per image. 

6) NYU Depth Data Set v2 (NYU) (HI: 1449 indoor 
scene images with one ground truth per image. Down- 
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sampled versions (320 x 240) i) are used with frame 
pixels cropped. Only RGB channels are used in our 
experiment; the depth maps are not used. 

In order to compare with the other state-of-the-art methods, 
we follow 1^ and train our boundary classifiers with the 200 
training images in BSDS300. Five ground truth segmentations 
are selected for each image in the order of increasing details 
as indicated by the number of true segments. The training and 
the testing are done for each detail level, and the results are 
combined into a segmentation hierarchy. In our performance 
evaluation of different configurations of the merge tree model, 
we test on the testing images in BSDS500. For comparisons 
with other methods, we test on all six data sets. 

Appendix summarizes the features used for boundary 
classification, most of which follows ED. Appendix pro¬ 
vides the parameters that we use in our hierarchical merge tree 
model experiments. 


B. Evaluation metrics 

Following m, we use the segmentation covering (m, the 
probabilistic Rand index 14^ . and the variation of informa¬ 
tion 1421 , ll43l for segmentation accuracy evaluation. Here, 
we summarize the three evaluation metrics. For more details, 
please refer to (Tl. 

The segmentation covering measures averaged matching 
between proposed segments with a ground truth labeling, 
defined as 


SC{S,Sg) 


E l Si I I Si n s j I 

—— max ^ 

1^1 ^3 ^^9 U Sj\ 


(9) 


where V is the set of all pixels in an image. It matches each 
proposed segment to a true segment, with which the proposed 
segment has the largest overlapping ratio, and computes the 
sum of such optimal overlapping ratios weighted by relative 
segment sizes. 

The Rand index, originally proposed in 1411 . measures 
pairwise similarity between two multi-label clusterings. It is 
defined as the ratio of the number of pixel pairs that have 
identical labels in S and Sg or have different labels in S and 
Sg, over the number of all pixel pairs. 

RI{S, 5,) = ^ ^ I {S{i) = S{j) A Sg{t) = Sg{j)) , 

I 2 j i<j 

( 10 ) 

where S{i) is the label of the ith pixel in S, and I(-) is an 
indicator function that returns 1 if the input condition is met or 
0 otherwise. The Rand error is sometimes used to refer 1 — RL 
The probabilistic Rand index is the Rand index averaged over 
multiple ground truth labelings if available. 

The variation of information measures the relative entropy 
between a proposed segmentation and a ground truth labeling, 
defined as 


VI{S, Sg) = H{S I 5,) + H{Sg I 5), (11) 

where H{S\Sg) and H{Sg\S) are conditional image en¬ 
tropies. Denote the set of all labels in S sls Cs and the set of 


TABLE I 

Results on BSDS500 using the constrained conditional model 

(CCM) EORMULATION OR GREEDY TREE MODEL (GREEDY) IsJ), (SS] IN 
COMBINATION WITH THE ENSEMBLE BOUNDARY CLASSIEIER (EC) OR 
SINGLE BOUNDARY CLASSIEIER (SC). THE SEGMENTATION COVERING 

(Covering), the probabilistic Rand index (PRI), and the 

VARIATION OE INEORMATION (VI) ARE REPORTED EOR OPTIMAL DATA 
SET SCALE (ODS) AND OPTIMAL IMAGE SCALE (OIS). 



Covering 

PRI 

VI 

HMT variant 

ODS 

OIS 

ODS 

OIS 

ODS 

OIS 

CCM+EC 

0.594 

0.607 

0.804 

0.809 

1.682 

1.556 

CCM+SC 

0.573 

0.581 

0.779 

0.781 

1.690 

1.617 

Greedy+EC 

0.587 

0.620 

0.821 

0.834 

1.737 

1.589 

Greedy+SC (351 

0.582 

0.601 

0.805 

0.812 

1.748 

1.639 


all labels in Sg as Csg, we have 


ms IS,) 


leCs, ^ 


( 12 ) 


where P{lg) is the probability that a pixel in Sg receives label 
Ig, and P{l,lg) is the joint probability that a pixel receives 
label I in S and label Ig in Sg. H{Sg | S') can be defined 
similarly by switching S and Sg in 

For each data set, segmentation results are evaluated at a 
universal fixed scale (ODS) for the entire data set and at a fixed 
scale per testing image (OIS), following (H. The evaluated 
numbers are averaged over all available ground truth labelings. 
As pointed out in 1^ . since we focus on region segmentation, 
the pixelwise boundary-based evaluations for contour detection 
results ini are not relevant, and we use only the region-based 
metrics. 


C. Ensemble single boundary classifier and constrained 
conditional model greedy tree model 

We evaluate the performance of using single (“SC”) or 
ensemble boundary classifiers (“EC”) (Section |III-C| ) with 
our hierarchical merge tree model. We also compare the 
proposed constrained conditional model (“CCM”) formulation 
and greedy tree model (“Greedy”) previously proposed in 1^ , 
Ea. The greedy tree model shares the same hierarchical 
merge tree structure and scores each tree node only based 
on local merges the node is involved with, based on which a 
subset of highest-scored nodes that conform with the region 
consistency constraint are greedily selected. The training is 
done using the 200 training images in BSDS300 as described 
in Section V-A and we show the testing results on the 200 
testing images in BSDS500 in Table |T| 

A comparison between the first two rows in Table |I| shows 
that using ensemble boundary classifiers outperforms using 
only a single boundary classifier among all metrics, which 
supports our claim that the classifier ensemble is better able 
to capture underlying merging characteristics of regions at 
different size scales. 

Comparing the first and the third row, we can see that CCM 
significantly outperforms the greedy model in terms of VI, 
which is preferred over the other metrics for segmentation 
quality evaluation 1^ . It appears that CCM is outperformed 
by the greedy tree model in terms of PRI, but this is because 




















TABLE II 


TABLE III 


Results on BSDS500 of hierarchical merge tree model with 

ITERATIVE SEGMENTATION ACCUMULATION. THE SEGMENTATION 

COVERING (Covering), the probabilistic Rand index (PRI), and 

THE VARIATION OF INFORMATION (VI) ARE REPORTED FOR OPTIMAL 
DATA SET SCALE (ODS) AND OPTIMAL IMAGE SCALE (OIS). 



Covering 

PRI 

VI 

Iteration 

ODS 

OIS 

ODS 

OIS 

ODS 

OIS 

0 

0.594 

0.607 

0.804 

0.809 

1.682 

1.556 

1 

0.601 

0.637 

0.825 

0.841 

1.661 

1.498 

2 

0.612 

0.654 

0.829 

0.853 

1.596 

1.432 

3 

0.618 

0.666 

0.834 

0.860 

1.564 

1.407 

4 

0.624 

0.671 

0.834 

0.864 

1.545 

1.391 

5 

0.624 

0.676 

0.836 

0.865 

1.544 

1.378 

6 

0.626 

0.678 

0.835 

0.867 

1.539 

1.374 

7 

0.628 

0.679 

0.835 

0.868 

1.532 

1.373 

8 

0.628 

0.679 

0.835 

0.869 

1.534 

1.370 

9 

0.628 

0.680 

0.835 

0.869 

1.530 

1.371 

10 

0.629 

0.679 

0.835 

0.869 

1.526 

1.375 


Results on BSDS500 with Gaussian white noise with different 

VARIANCES 0-2 OF HIERARCHICAL MERGE TREE MODEL WITH ITERATIVE 
SEGMENTATION ACCUMULATION. THE TRAINING USES NOISE-FREE 
BSDS300 TRAINING IMAGES. THE SEGMENTATION COVERING 
(Covering), the probabilistic Rand index (PRI), and the 

VARIATION OF INFORMATION (VI) ARE REPORTED FOR OPTIMAL DATA 
SET SCALE (ODS) AND OPTIMAL IMAGE SCALE (OIS). 



Covering 

PRI 

VI 


Iter. 

ODS 

OIS 

ODS 

OIS 

ODS 

OIS 

0.001 

0 

0.457 

0.459 

0.587 

0.587 

1.929 

1.917 

10 

0.617 

0.665 

0.836 

0.860 

1.576 

1.411 

0.01 

0 

0.343 

0.344 

0.394 

0.394 

2.278 

2.271 

10 

0.574 

0.606 

0.805 

0.815 

1.754 

1.581 


0.01 (Appendix E for superpixel generation. We keep all 
Other settings identical to the previous experiment and run 
the iterative testing (Fig. for T = 10 iterations. The 


both models are trained using the labels determined based on 
VI (Section III-C). We perform another experiment where both 
are trained using the labels determined based on the Rand 
index, and CCM outperforms the greedy model 0.829 vs. 
0.826 in terms of ODS PRI and 0.855 vs. 0.848 in terms 
of OIS PRI. 


The fourth row shows the results using our previous 
work 1^ . It is clear that the proposed constrained conditional 
model and ensemble boundary classifier are an improvement 
over our previous approach without including the iterative 
segmentation accumulation. 


results for the first iteration and the last iteration are shown in 
Table Ini] Comparing Table |I^ and the corresponding entries in 
Table |Tj we can see that when the input images are noisy, the 
performance from HMT that uses gPb boundary saliency to 
generate the merge trees are severely degraded. However, with 
the iterative approach, the HMT performance is significantly 
improved. This is because the iterative approach enables the 
use of boundary classifiers that utilize different cues for better 
merge tree generation than using only boundary detection 
saliency under the noisy setting. In addition, the iterative 
segmentation accumulation stabilizes the HMT performance 
for noisy inputs by smoothing out non-systematic errors. 


D. Non-iterative iterative segmentation accumulation 

We evaluate the performance of our hierarchical merge tree 
model with or without iterative segmentation accumulation 
(Section [TVj). The experimental setting follows the previous ex¬ 
periments in Section V-C The constrained conditional model 
formulation and ensemble boundary classifiers are adopted. 
Results at each iteration are shown in Table im 

We can see that despite occasional oscillations, the results 
are improved through iterations. The rate of improvement 
slows down as more iterations are included in the averaging 
process. More sophisticated ways of choosing segmentations 
to average over can be used, such as to average segmentations 
only from the iterations that achieve the top accuracy on 
some validation set. In our experiment, since we would like 
to compare our method with other methods, we keep the 
same setting for training and testing data sets and do not 
use a separate validation set. We fix the iteration number to 
T = 10 and only report the results from averaging all the 
segmentations. 

We also test how the iteration infiuences the robustness of 
our method to image noise. Gaussian white noise is added to 
the BSDS500 testing images. We experiment with different 
large noise variances = 0.001 and = 0.01, so that 
the noise is clearly observable, and the input images are con¬ 
siderably corrupted. The previous model learned with noise- 
free BSDS300 training images is then used for testing. We 
observe significant decrease in the strength of gPb boundary 
detection, so we lower the initial water level to 0.005 from 


E. Comparisons with other methods 

In this section, we compare our proposed iterative hierarchi¬ 
cal merge tree method (CCM ensemble boundary classifier 
-I- iteration, under name “HMT”) with various other state-of- 
the-art region segmentation methods and benchmarks ID,E5l, 
ESI, ESi, ESI, E 2 I . E 4 II in very recent years on the public 
data sets. The results are shown in Table |IV| Note that E2l 
generates a single segmentation instead of contour hierarchies 
for each image. The OIS evaluations are therefore essentially 
the same as the ODS results, so we exclude the OIS entries for 
the sake of clarity. Fig. [7] shows sample testing segmentation 
results for each data set. 

From Table we can see that our method is highly 
competitive and outperforms very recent state-of-the-art meth¬ 
ods on some data sets, including BSDS500, which is the 
most used data set for image segmentation evaluation. It is 
noteworthy that the generalization of our method is almost as 
good as ISCRA ESl by being trained only on BSDS (general 
natural photos) and achieving competitive results on the NYU 
data set (indoor scene photos). It is also worth pointing out 
that our hierarchical segmentation framework can be used 
in combination with other features that can better guide the 
boundary classification. For example, using the most recent 
piecewise fiat embedding (PFE) EH, we expect the results to 
be further improved in a manner similar to the results from 
“MCG” to “PFE-MCG” on BSDS500 in Table [Ivl 
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TABLE IV 

Results of different methods on (a) BSDS300, (b) BSDS500, (c) 
MSRC, (D) VOC12, (E) SBD, AND (f) NYU DATA SET. The 
SEGMENTATION COVERING (COVERING), THE PROBABILISTIC RAND 
INDEX (PRI), AND THE VARIATION OF INFORMATION (VI) ARE REPORTED 
FOR OPTIMAL DATA SET SCALE (ODS) AND OPTIMAL IMAGE SCALE 

(OIS). 



BSDS300 

Covering 

PRI 

VI 

Method 

ODS 

OIS 

ODS 

OIS 

ODS 

OIS 

gPb-OWT-UCM IjJ 

0.59 

0.65 

0.81 

0.85 

1.65 

1.47 

ISCRA (25] 

0.60 

0.67 

0.81 

0.86 

1.61 

1.40 

HOCC |22| 

0.60 

- 

0.81 

- 

1.74 

- 

MCG (23] 

0.61 

0.67 

0.81 

0.86 

1.55 

1.37 

HMT 

0.61 

0.67 

0.82 

0.86 

1.58 

1.40 


(a) 



BSDS500 

Covering 

PRI 

VI 

Method 

ODS 

OIS 

ODS 

OIS 

ODS 

OIS 

gPb-OWT-UCM (T] 

0.59 

0.65 

0.83 

0.86 

1.69 

1.48 

ISCRA (25] 

0.59 

0.66 

0.82 

0.86 

1.60 

1.42 

GALA (26] 

0.61 

0.67 

0.84 

0.86 

1.56 

1.36 

HOCC (13 

0.60 

- 

0.83 

- 

1.79 

- 

DC (49] 

0.59 

0.64 

0.82 

0.85 

1.68 

1.54 

MCG (13 

0.61 

0.66 

0.83 

0.86 

1.57 

1.39 

PFE-mPb (H 

0.62 

0.67 

0.84 

0.86 

1.61 

1.43 

PFE-MCG dK 

0.62 

0.68 

0.84 

0.87 

1.56 

1.36 

HMT 

0.63 

0.68 

0.84 

0.87 

1.53 

1.38 


(b) 



MSRC 

Covering 

PRI 

VI 

Method 

ODS 

OIS 

ODS 

OIS 

ODS 

OIS 

gPb-OWT-UCM [il 

0.65 

0.75 

0.78 

0.85 

1.28 

0.99 

ISCRA f25l 

0.67 

0.75 

0.77 

0.85 

1.18 

1.02 

HMT 

0.67 

0.77 

0.79 

0.86 

1.23 

0.93 


(c) 


tree model, that limits the search space to one that is induced 
by tree structures and thus linear with respect to the number of 
initial superpixels. The framework allows the use of various 
merging saliency heuristics and features, and its supervised 
nature grants its capability of learning complex conditions 
for merging decisions from training data without the need 
for parameter tuning or the dependency on any classification 
model. Globally optimal solutions can be efficiently found 
under constraints to generate final segmentations thanks to the 
tree structure. 

We also introduce a modification to the hierarchical merge 
tree model that iteratively trains a new boundary classifier 
with accumulated samples for merge tree construction and 
merging probability prediction and accumulates segmentation 
to generate contour maps. 

For further improvement, the combination of merge trees 
from each iteration as one single model and its global res¬ 
olution can be investigated. Furthermore, it would be inter¬ 
esting to study the application of our method to semantic 
segmentation with the introduction of object-dependent prior 
knowledge. 

Acknowledgment 

This work was supported by NSF IIS-1149299 and NIH 
1R01NS075314-01. We thank Zhile Ren at Brown University 
for providing the ISCRA results on the testing data sets. 
We also thank the editor and the reviewers whose comments 
greatly helped improved this paper. 

Appendix A 

Summary oe boundary classifier features 



VOC12 

Covering 

PRI 

VI 

Method 

ODS 

OIS 

ODS 

OIS 

ODS 

OIS 

gPb-OWT-UCM (iJ 

0.46 

0.59 

0.76 

0.88 

0.65 

0.50 

ISCRA (H 

0.50 

0.58 

0.69 

0.75 

1.01 

0.93 

HMT 

0.49 

0.63 

0.77 

0.91 

0.60 

0.44 


(d) 



SBD 

Covering 

PRI 

VI 

Method 

ODS 

OIS 

ODS 

OIS 

ODS 

OIS 

gPb-OWT-UCM [11 

0.58 

0.64 

0.86 

0.89 

1.88 

1.62 

ISCRA (25) 

0.62 

0.68 

0.87 

0.90 

1.73 

1.49 

HMT 

0.61 

0.67 

0.86 

0.90 

1.72 

1.48 


(e) 



NYU 

Covering 

PRI 

VI 

Method 

ODS 

OIS 

ODS 

OIS 

ODS 

OIS 

gPb-OWT-UCM (ij 

0.55 

0.60 

0.90 

0.92 

1.89 

1.89 

ISCRA (B1 

0.57 

0.62 

0.90 

0.92 

1.82 

1.63 

HMT 

0.57 

0.61 

0.90 

0.92 

1.83 

1.66 


(f) 


VI. Conclusion 

Exhaustive search for optimal superpixel merging in image 
region segmentation is intractable. We propose a hierarchical 
image segmentation framework, namely the hierarchical merge 


We use 55 features from region pairs to train the boundary 
classifiers, including: 

1) Geometry (5-dimensional): Areas of two regions nor¬ 
malized by image area and perimeters and boundary 
length of two regions normalized by length of the image 
diagonal. 

2) Boundary (4-dimensional): Means and medians of 
boundary pixel intensities from gPb and UCM HI. 
Boundary detector gPb generates probability maps that 
describe how likely each pixel belong to an image 
boundary. UCM is the result from post-processing gPb 
probability maps that depicts how boundary pixels con¬ 
tribute to contour hierarchies in images. The boundary 
pixels follow the definition in 

3) Color (24-dimensional): Absolute mean differences, Li 
and distances and absolute entropy differences be¬ 
tween histograms (10-bin) of LAB and HSV compo¬ 
nents of original images. 

4) Texture (8-dimensional): Li and distances between 
histograms of texton 1501 (64-bin) and SIFT 15T1 dictio¬ 
nary of 256 words. The SIFT descriptors are computed 
densely, and 8x8 patches are used on gray. A, and B 
channel of original images. 

5) Geometric context (14-dimensional): Li and x^ dis¬ 
tances between histograms (32-bin) of the probability 
maps of each of the seven geometric context labels. 




























































































10 



Fig. 7. Testing segmentation results on BSDS300 (top-left), BSDS500 (top-right), MSRC (middle-left), SBD (middle-right), VOC12 (bottom-left), and NYU 
(bottom-right) data set. For each image, from top to bottom: the original image, the hierarchical contour map, the ODS covering segmentation, and the OIS 
covering segmentation. The training uses BSDS300 training images. 
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The geometric context labels indicate orientations of the 
surfaces in the images, which are predicted by a fixed 
pre-trained model provided by |[52l . 

Appendix B 

Summary of parameters 

We use the watershed algorithm for superpixel generation, 
for which the water level needs to be specified. In general, 
lowering the water level reduces under-segmentation by pro¬ 
ducing more superpixels, which gives us sets of high-precision 
superpixels to start with, but also increases the computation 
cost. We fixed the water level at 0.01 for all five datasets 
(BSDS300/500, MSRC, SBD, and VOC12), except the NYU 
data set. For the NYU data set of indoor scene images, we 
observe the decrease in gPb boundary detection strength, so 
we lower the water level to 0.001. We also pre-merge regions 
smaller than 20 pixels to their neighboring regions with the 
lowest boundary barrier, i.e. the median of boundary detection 
probabilities on the boundary pixels between the two regions. 

We train 255 fully grown decision trees for the random 
forest boundary classifier. To train each decision tree, 70% of 
training samples are randomly drawn and used. The number 
of features examined at each node is the square root of the 
total number of features ([v^J = 7). In the experiments, 
the training data are usually imbalanced. The ratios between 
the number of positive and negative samples are sometimes 
considerably greater than 1. Therefore, we assign to each class 
a weight reciprocal to the number of samples in the class to 
balance the training. 

We fix the number of iterations to T = 10 for all data sets 
for our iterative hierarchical merge tree model. 
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